La ClĆnica de los Alpes desea conocer cuales son los principales factores de muerte que disminuyen la expectativa de vida de los ciudadanos de su paĆs para realizar campaƱas de concientización con el fin de mejorar la calidad de vida de los ciudadanos. La ClĆnica cuenta con un conjunto de datos sobre las expectativas de vida a lo largo de los aƱos en los Alpes y paĆses cercanos, junto con algunos indicadores que miden la salud de una persona, como lo puede ser el Ćndice de masa corporal, la incidencia de varias enfermedades y algunos factores socioculturales como el consumo de alcohol o tabaco. Esta información quieren utilizarla para construir un modelo que les pueda ayudar para resolver las siguientes tareas:
Identificar las variables que mƔs impactan en la expectativa de vida de la gente de los Alpes. Predecir la expectativa de vida en los Alpes a partir de las variables de interƩs.
seed = 161
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pandas_profiling import ProfileReport
# Composicion de pipelines
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import PolynomialFeatures, Normalizer
from sklearn.preprocessing import MinMaxScaler,StandardScaler, MaxAbsScaler
from sklearn.model_selection import train_test_split
# Regresion lineal
from sklearn.linear_model import LinearRegression
# Importar/ Exportar modelos
from joblib import dump, load
from itertools import combinations
# Metricas
from sklearn.metrics import mean_squared_error as mse
# q-q plots
import scipy.stats as stats
# Visualizacion de datos
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette("Paired")
data_or = pd.read_csv('datos.csv', na_values="NA-VALUE")
print(data_or.shape)
data_or.head(5)
(294, 19)
| Expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | BMI | under-five deaths | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 10-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2644 | 151.0 | 0 | 1.80 | 423.295351 | 9.0 | 0 | 68.6 | 0 | 91.0 | 4.87 | 9.0 | 0.1 | 2284.378580 | 146.0 | 0.1 | 0.1 | 0.693 | 14.6 |
| 1 | 2645 | 153.0 | 0 | 1.79 | 45.851058 | 85.0 | 0 | 67.8 | 0 | 91.0 | 5.90 | 9.0 | 0.1 | 229.714718 | 99789.0 | 0.1 | 0.1 | 0.683 | 13.7 |
| 2 | 2646 | 155.0 | 0 | 1.51 | 310.820338 | 88.0 | 0 | 67.0 | 0 | 85.0 | 5.30 | 84.0 | 0.1 | 1842.444210 | 99184.0 | 0.1 | 0.1 | 0.679 | 13.5 |
| 3 | 2647 | 157.0 | 0 | 1.35 | 330.100739 | 91.0 | 4 | 66.2 | 0 | 91.0 | 5.66 | 89.0 | 0.1 | 1837.977391 | 98611.0 | 0.1 | 0.1 | 0.674 | 13.2 |
| 4 | 2648 | 158.0 | 0 | 1.24 | 40.491289 | 93.0 | 0 | 65.5 | 0 | 91.0 | 4.75 | 91.0 | 0.1 | 263.272360 | 9882.0 | 0.1 | 0.1 | 0.676 | 13.7 |
sns.pairplot(data_or,height=3, y_vars='Expectancy',x_vars =data_or.columns[1:7])
sns.pairplot(data_or,height=3, y_vars='Expectancy',x_vars =data_or.columns[7:13])
sns.pairplot(data_or,height=3, y_vars='Expectancy',x_vars =data_or.columns[13:19])
<seaborn.axisgrid.PairGrid at 0x27a9bd9a280>
data_or=data_or.dropna(axis=0)
#data_or.drop(data_or.loc[data_or['Adult Mortality']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['infant deaths']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Alcohol']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['percentage expenditure']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Hepatitis B']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Measles']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['BMI']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['under-five deaths']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Polio']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Total expenditure']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Diphtheria']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['HIV/AIDS']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['GDP']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['thinness 10-19 years']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['thinness 5-9 years']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Income composition of resources']==0].index, inplace=True)
#data_or.drop(data_or.loc[data_or['Schooling']==0].index, inplace=True)
#data_or['Adult Mortality']=data_or['Adult Mortality' !=0]
#data_or['infant deaths']=data_or['infant deaths'!=0]
#data_or['Alcohol']=data_or['Alcohol'!=0]
#data_or['percentage expenditure']=data_or['percentage expenditure'!=0]
#data_or['Hepatitis B']=data_or['Hepatitis B'!=0]
#data_or['Measles']=data_or['Measles'!=0]
#data_or['BMI']=data_or['BMI'!=0]
#data_or['under-five deaths']=data_or['under-five deaths'!=0]
#data_or['Polio']=data_or['Polio'!=0]
#data_or['Total expenditure']=data_or['Total expenditure'!=0]
#data_or['Diphtheria']=data_or['Diphtheria'!=0]
#data_or['HIV/AIDS']=data_or['HIV/AIDS'!=0]
#data_or['GDP']=data_or['GDP'!=0]
#data_or['thinness 10-19 years']=data_or['thinness 10-19 years'!=0]
#data_or['thinness 5-9 years']=data_or['thinness 5-9 years'!=0]
#data_or['Income composition of resources']=data_or['Income composition of resources'!=0]
#data_or['Schooling']=data_or['Schooling'!=0]
#'Adult Mortality','infant deaths','Alcohol','percentage expenditure','Hepatitis B','Measles','BMI','under-five deaths',
#'Polio','Total expenditure','Diphtheria','HIV/AIDS','GDP','Population','thinness 10-19 years','thinness 5-9 years',
#'Income composition of resources','Schooling'
data_or.describe()
| Expectancy | Adult Mortality | infant deaths | Alcohol | percentage expenditure | Hepatitis B | Measles | BMI | under-five deaths | Polio | Total expenditure | Diphtheria | HIV/AIDS | GDP | Population | thinness 10-19 years | thinness 5-9 years | Income composition of resources | Schooling | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 294.000000 | 2.940000e+02 | 294.000000 | 294.000000 | 294.000000 | 294.000000 |
| mean | 2790.500000 | 180.156463 | 22.748299 | 4.031327 | 250.691789 | 67.258503 | 2299.707483 | 39.811565 | 31.921769 | 82.459184 | 5.934014 | 80.578231 | 2.866327 | 2888.804225 | 4.541904e+06 | 5.141497 | 5.135374 | 0.492966 | 9.931293 |
| std | 85.014705 | 149.969676 | 28.065706 | 3.411991 | 636.324313 | 35.669719 | 6887.681389 | 20.323780 | 43.125549 | 21.932024 | 3.285364 | 24.922111 | 6.876873 | 7269.383426 | 1.293499e+07 | 4.007686 | 4.123723 | 0.289985 | 4.827973 |
| min | 2644.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 7.000000 | 0.000000 | 5.000000 | 0.100000 | 0.000000 | 0.000000e+00 | 0.100000 | 0.100000 | 0.000000 | 0.000000 |
| 25% | 2717.250000 | 82.000000 | 3.000000 | 1.375000 | 0.000000 | 53.000000 | 0.000000 | 19.300000 | 3.000000 | 75.000000 | 4.400000 | 75.000000 | 0.100000 | 0.000000 | 0.000000e+00 | 1.625000 | 1.600000 | 0.415000 | 9.725000 |
| 50% | 2790.500000 | 153.000000 | 10.000000 | 2.720000 | 27.137321 | 83.000000 | 55.500000 | 43.000000 | 12.000000 | 92.000000 | 5.405000 | 92.000000 | 0.100000 | 430.824070 | 1.055190e+05 | 5.050000 | 4.900000 | 0.600500 | 11.100000 |
| 75% | 2863.750000 | 231.000000 | 29.000000 | 6.632500 | 194.536691 | 94.000000 | 816.500000 | 57.475000 | 42.000000 | 96.000000 | 7.075000 | 96.000000 | 0.775000 | 2244.678564 | 2.482152e+06 | 6.675000 | 6.700000 | 0.715750 | 12.975000 |
| max | 2937.000000 | 723.000000 | 116.000000 | 12.220000 | 4003.908598 | 99.000000 | 49871.000000 | 79.300000 | 191.000000 | 99.000000 | 17.600000 | 99.000000 | 43.500000 | 45758.955400 | 7.827147e+07 | 15.800000 | 16.400000 | 0.836000 | 15.700000 |
restr = data_or.apply(lambda x:np.abs(stats.zscore(x))<3).all(axis=1)
data_cl = data_or.drop(data_or.index[~restr],inplace=False)
data_cl.shape
(230, 19)
sns.pairplot(data_cl,height=3, y_vars='Expectancy',x_vars =data_or.columns[1:7])
sns.pairplot(data_cl,height=3, y_vars='Expectancy',x_vars =data_or.columns[7:13])
sns.pairplot(data_cl,height=3, y_vars='Expectancy',x_vars =data_or.columns[13:19])
<seaborn.axisgrid.PairGrid at 0x27a99641f70>
ProfileReport(data_cl)